Modified Approach of Multinomial Naïve Bayes for Text Document Classification

نویسنده

  • S. W. Mohod
چکیده

This work proposes a text classification using modified approach of Multinomial Naïve Bayes for justifying and identifying the documents into a particular category. Due to the exploration of the textual information from the electronic digital documents as well as World Wide Web. Naïve Bayes theorem is effective for classification of text documents into the predefined categories by means of the probabilistic values. However, its performance is repetitively inadequate by inappropriate feature selection. The aim of this paper is to propose a method that will improve the classification accuracy decision. In addition a new feature selection method for text document classification in machine learning is also proposed. In machine learning the training set is generated for testing the documents. Scoring method is used to enhance the efficiency of both classifications with a relevance to accuracy and performance. Keyword: Text classification, Naïve Bayes, Feature selection. Introduction With the increasing availability of electronic documents and rapid growth of the World Wide Web and data in digital format, the task of automatic document classification is important for organization. Proper classification of electronic documents, online news, blogs, e-mails and digital libraries requires Text Mining, Machine learning and natural language processing techniques to extract required knowledge information. Text mining makes an attempt to discover interesting information and knowledge from unstructured documents. The important task is to develop the automatic classifier to maximize the accuracy and efficiency to classify the existing and incoming documents. Voluminous information of an organization is stored in an unstructured form of reports messages, news and email [1]. However, data mining deals with structured data, whereas text presents special characteristics and is unstructured. The important task is how these documented data can be properly retrieved, presented and classified it is difficult to machine, so there has been a growing interest in this area of research [2]. Extraction, integration and classification of text documents from different sources and knowledge information discovery which finds features from available documents are important. In data mining, Machine learning is often used for Prediction or Classification. Classification involves finding rule that partitions the data into disjoint groups. The input for the classification is the training data set, whose class labels are already known. Classifications analyze the training data set and construct a model based on the class label. The goal of classification is to build a set of models that can correctly predict the class of the different objects. Machine learning is an area of artificial intelligence concerned with the development of techniques which allow computers to "learn". More specifically, machine learning is a method for creating computer programs by the analysis of data sets since machine learning study the analysis of data. The challenging task is of text classification performance, because many problems are due to high dimensionality of feature space and unordered collection of words in text documents. Various machine learning algorithms available and utilize in document classification. Naïve Bayes has been one of the popular machine learning algorithm because of its simplicity. [2][3][4] easy to implement and draws better accuracy in large datasets[5]. Naïve Bayes classifier performing well in classification task where the probability is calculated by the Naïve Bayes independent assumption [6] [7]. The paper mainly focuses on reducing the number of features class dependent by using the text document feature selection with new feature scoring method and using proposed feature selection method implement Modified approach of Multinomial Naïve Bayes classification model to classify testing text documents. Thousands of term word occurs in the text document, so it is important to reduce the dimensionality of feature using feature selection process [8], to resolve this problem many feature evaluation metrics have been explored such as X2 Statistics (CHI), Information Gain (IG), mutual information, term strength, document frequency, Term Frequency Inverse Document Frequency. With the help of these approaches it is possible to reduce the high dimensionality of features. Proposed feature scoring metrics to select the feature is the most effective method to reduce the dimensionality of feature and improve the efficiency and accuracy of classifier. In this approach document preprocessing is also important to reduce the complexity and high dimensionality of features occurs in the text document. Vol 6 • Number 2 April Sep 2015 pp. 196-200 Impact Factor: 2.5 Available at www.csjournals.com DOI: 10.090592/IJCSC.2015.614 Page | 197 Feature selection Approaches Feature selection helps in the problem of text classification to improve efficiency and accuracy. In our approach we are examining different feature selection methods and then will find wheather our proposed method is effective to other studied method. A. TF (Term Frequency) Term frequency in the given document is simply the number of times a given term appears in that document. TF used to measure the importance of item in a document, the number of occurrences of each term in the document. Every document is described as a vector consisting of words. Importance of the term „t‟ within the particular document with „ni‟ being the number of occurrences of the considered term and the denominator is the number of occurrences of all terms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier

With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...

متن کامل

Improving on the Naïve Bayes Document Classifier

The Naïve Bayes document classifier has been used in many document classification algorithms [1], but is only really useful on a small subset of documents due to it’s many shortcomings [2]. By augmenting the basic functionality of the simple Naïve Bayes classifier, the classification algorithm can be applied to a much wider range of documents. This paper investigates the advantages which can be...

متن کامل

Classification Using Naïve Bayes- a Survey

Classification, particularly Text Classification, is a supervised learning approach categorizing into various categories, the available training set of correctly identified observations analyzed into a set of features. There are many phases involved in classification. The main classification phase involves the use of classification algorithms or classifiers. Among the various classifiers, the N...

متن کامل

Is Naïve Bayes a Good Classifier for Document Classification?

Document classification is a growing interest in the research of text mining. Correctly identifying the documents into particular category is still presenting challenge because of large and vast amount of features in the dataset. In regards to the existing classifying approaches, Naïve Bayes is potentially good at serving as a document classification model due to its simplicity. The aim of this...

متن کامل

Or gate Bayesian networks for text classification: A discriminative alternative approach to multinomial naive Bayes

We propose a simple Bayesian network-based text classifier, which may be considered as a discriminative counterpart of the generative multinomial naive Bayes classifier. The method relies on the use of a fixed network topology with the arcs going form term nodes to class nodes, and also on a network parametrization based on noisy or gates. Comparative experiments of the proposed method with nai...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015